Search CORE

arXiv.org e-Print Archive

UCL Discovery

Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

Author: A Poyias
D Arroyuelo
D Lemire
D Lemire
D Lemire
G Marsaglia
GH Gonnet
H Bannai
H Luan
J Fischer
J Fischer
J Jansson
J Kärkkäinen
J Ziv
J Ziv
JA Feldman
JG Cleary
K Chung
L Carter
P Tchebychev
RM Karp
RM Robinson
TA Welch
Y Nakashima
Publication venue
Publication date: 09/06/2017
Field of study

We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

Fast and robust multiple sequence alignment with phylogeny-aware gap placement

Author: A Biegert
A Löytynoja
A Löytynoja
A Löytynoja
A Viterbi
Adam M Szalkowski
AM Altenhoff
AM Szalkowski
B Paten
C Dessimoz
C Grasso
C Lee
D Robinson
DA Dalquen
G Gonnet
GH Gonnet
GH Gonnet
GW Stuart
J Felsenstein
JD Thompson
JD Thompson
JL Thorne
JM Sauder
K Katoh
M Anisimova
M Kimura
O Gascuel
O Gotoh
R Durbin
RC Edgar
S Pascarella
S Whelan
SA Benner
SB Needleman
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Gene fusions and gene duplications: relevance to genomic annotation and functional analysis

Author: A Bateman
A Dautry-Varsat
A Maruya
B Labedan
C Vogel
CF Higgins
F Titgemeyer
GH Gonnet
GH Gonnet
GH Thomas
H Salgado
I Saint-Girons
IP Crawford
J Gough
JA Gerlt
JD Glasner
K Fukami-Kobayashi
LA Nahum
M El Ghachi
M Madera
M Riley
M Riley
MH Serres
MH Serres
MY Galperin
NB Vartak
P Liang
P Liang
PD Karp
PJ Piggot
R Jaggi
RM Schwartz
RR Chaudhuri
S Sundararaj
SB Needleman
SF Altschul
SY Yang
TF Smith
WR Gilks
Y Fujita
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular) proteins consist of two or more components (modules) encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work overstated the number of multimodular proteins in E. coli. This work corrects the identification of modules by including sequence information from proteins in 50 sequenced microbial genomes. RESULTS: Multimodular E. coli K-12 proteins were identified from sequence similarities between their component modules and non-fused proteins in 50 genomes and from the literature. We found 109 multimodular proteins in E. coli containing either two or three modules. Most modules had standalone sequence relatives in other genomes. The separated modules together with all the single (un-fused) proteins constitute the sum of all unimodular proteins of E. coli. Pairwise sequence relationships among all E. coli unimodular proteins generated 490 sequence similar, paralogous groups. Groups ranged in size from 92 to 2 members and had varying degrees of relatedness among their members. Some E. coli enzyme groups were compared to homologs in other bacterial genomes. CONCLUSION: The deleterious effects of multimodular proteins on annotation and on the formation of groups of paralogs are emphasized. To improve annotation results, all multimodular proteins in an organism should be detected and when known each function should be connected with its location in the sequence of the protein. When transferring functions by sequence similarity, alignment locations must be noted, particularly when alignments cover only part of the sequences, in order to enable transfer of the correct function. Separating multimodular proteins into module units makes it possible to generate protein groups related by both sequence and function, avoiding mixing of unrelated sequences. Organisms differ in sizes of groups of sequence-related proteins. A sample comparison of orthologs to selected E. coli paralogous groups correlates with known physiological and taxonomic relationships between the organisms

Woods Hole Open Access Server

ASH structure alignment package: Sensitivity and selectivity in domain classification

Author: A Prlic
AG Murzin
CA Orengo
Daron M Standley
DM Standley
DM Standley
E Krissinel
G Vogt
GH Gonnet
Haruki Nakamura
Hiroyuki Toh
J Zhu
K Nakai
K Tomii
L Holm
L Holm
M Levitt
ML Sierk
R Kolodny
S Henikoff
S Kawashima
S Subbiah
SJ Mason
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Grammar-based distance in progressive multiple sequence alignment

Author: AY Mitrophanov
C Notredame
C Notredame
CB Do
David J Russell
DJ Lipman
GH Gonnet
Hasan H Otu
HH Otu
J Stoye
J Ziv
J Ziv
JD Thompson
JD Thompson
K Katoh
K Katoh
K Katoh
Khalid Sayood
MO Albertson
P Clote
R Durbin
RC Edgar
RC Edgar
S Henikoff
S Sze
SB Needleman
VD Gusev
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: We propose a multiple sequence alignment (MSA) algorithm and compare the alignment-quality and execution-time of the proposed algorithm with that of existing algorithms. The proposed progressive alignment algorithm uses a grammar-based distance metric to determine the order in which biological sequences are to be pairwise aligned. The progressive alignment occurs via pairwise aligning new sequences with an ensemble of the sequences previously aligned. Results: The performance of the proposed algorithm is validated via comparison to popular progressive multiple alignment approaches, ClustalW and T-Coffee, and to the more recently developed algorithms MAFFT, MUSCLE, Kalign, and PSAlign using the BAliBASE 3.0 database of amino acid alignment files and a set of longer sequences generated by Rose software. The proposed algorithm has successfully built multiple alignments comparable to other programs with significant improvements in running time. The results are especially striking for large datasets. Conclusion: We introduce a computationally efficient progressive alignment algorithm using a grammar based sequence distance particularly useful in aligning large datasets

DigitalCommons@University of Nebraska

Harvard University - DASH

Surprising results on phylogenetic tree building methods based on molecular sequences

Author: A Loytynoja
A Schneider
A Schneider
A Stamatakis
A Stamatakis
AC Roth
AM Altenhoff
AM Altenhoff
C Dessimoz
C Dessimoz
C Do
C Lee
CC McGeoch
DF Robinson
DL Swofford
DT Jones
E Sayers
E Zuckerkandl
F Sievers
G Laver
Gaston H Gonnet
GH Gonnet
GH Gonnet
GH Gonnet
GM Cannarozzi
GM Cannarozzi
J Felsenstein
J Hey
J Marmur
J Thompson
K Katoh
K Katoh
L Stuyver
M Anisimova
M Anisimova
M dos Reis
M Sanderson
M Steel
M Van Oven
N Saitou
O Gascuel
P Soltis
R Desper
S Guindon
S Guindon
S Hedges
S Le
S Whelan
SA Benner
SB Needleman
TF Smith
W Fitch
WM Fitch
Y Lin
Z Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Optimizing substitution matrix choice and gap parameters for sequence alignment

Author: CB Do
CB Do
CN Dewey
D Gusfield
DT Jones
E Kim
G Blackshields
GA Price
GH Gonnet
I Van Walle
J Flannick
J Kececioglu
J Pei
JD Thompson
JD Thompson
JG Henikoff
K Katoh
M Box
MA Larkin
MO Dayhoff
MP Styczynski
MS Waterman
O Chapelle
RC Edgar
RC Edgar
Robert C Edgar
S Henikoff
T Lassmann
T Muller
T Muller
TM Phuong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at <url>http://www.drive5.com/pop</url>.</p

Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase

Author: A Bairoch
A Sekowska
B Barcelona-Andres
B Labedan
B Labedan
B Wargnies
C Tricot
C Vander Wauven
GH Gonnet
I Paulsen
I Schomburg
J Felsenstein
JA Gerlt
JP Simon
L Grivell
M Kanehisa
M Zuniga
PC Babbitt
PD Karp
R Apweiler
R Cunin
RJ Roon
S Dashuang
SE Brenner
T Janowitz
TA Hall
The Gene Ontology Consortium
V Stalon
Y Nakada
Y Nakada
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data. RESULTS: We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH(2)-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences. CONCLUSIONS: We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context

DI-fusion

Joint Evolutionary Trees: A Large-Scale Method To Predict Protein Interfaces Based on Sequence Sampling

Author: A Armon
A Prlic
Alessandra Carbone
BW Matthews
CA Innis
CA Innis
CJ Tsai
CT Porter
DR Caffrey
E Kanamori
ELL Sonnhammer
G Cheng
GH Gonnet
H Chen
I Mihalek
JA Studier
JR Bradford
Ladislas A. Trojan
Michael Levitt
O Lichtarge
O Lichtarge
P Chakrabarti
Richard Lavery
RP Bahadur
S Henikoff
S Jones
S Madabushi
S Miller
SF Altschul
SJ Hubbard
Sophie Sacquin-Mora
SS Negi
Stefan Engelen
T Pupko
W Humphrey
WSJ Valdar
Y Ofran
Y Ofran
ZJ Hu
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The Joint Evolutionary Trees (JET) method detects protein interfaces, the core residues involved in the folding process, and residues susceptible to site-directed mutagenesis and relevant to molecular recognition. The approach, based on the Evolutionary Trace (ET) method, introduces a novel way to treat evolutionary information. Families of homologous sequences are analyzed through a Gibbs-like sampling of distance trees to reduce effects of erroneous multiple alignment and impacts of weakly homologous sequences on distance tree construction. The sampling method makes sequence analysis more sensitive to functional and structural importance of individual residues by avoiding effects of the overrepresentation of highly homologous sequences and improves computational efficiency. A carefully designed clustering method is parametrized on the target structure to detect and extend patches on protein surfaces into predicted interaction sites. Clustering takes into account residues' physical-chemical properties as well as conservation. Large-scale application of JET requires the system to be adjustable for different datasets and to guarantee predictions even if the signal is low. Flexibility was achieved by a careful treatment of the number of retrieved sequences, the amino acid distance between sequences, and the selective thresholds for cluster identification. An iterative version of JET (iJET) that guarantees finding the most likely interface residues is proposed as the appropriate tool for large-scale predictions. Tests are carried out on the Huang database of 62 heterodimer, homodimer, and transient complexes and on 265 interfaces belonging to signal transduction proteins, enzymes, inhibitors, antibodies, antigens, and others. A specific set of proteins chosen for their special functional and structural properties illustrate JET behavior on a large variety of interactions covering proteins, ligands, DNA, and RNA. JET is compared at a large scale to ET and to Consurf, Rate4Site, siteFiNDER|3D, and SCORECONS on specific structures. A significant improvement in performance and computational efficiency is shown

HAL-Inserm